Multi-view emotional expressions dataset using 2D pose estimation

Human body expressions convey emotional shifts and intentions of action and, in some cases, are even more effective than other emotion models. Despite many datasets of body expressions incorporating motion capture available, there is a lack of more widely distributed datasets regarding naturalized body expressions based on the 2D video. In this paper, therefore, we report the multi-view emotional expressions dataset (MEED) using 2D pose estimation. Twenty-two actors presented six emotional (anger, disgust, fear, happiness, sadness, surprise) and neutral body movements from three viewpoints (left, front, right). A total of 4102 videos were captured. The MEED consists of the corresponding pose estimation results (i.e., 397,809 PNG files and 397,809 JSON files). The size of MEED exceeds 150 GB. We believe this dataset will benefit the research in various fields, including affective computing, human-computer interaction, social neuroscience, and psychiatry.


Background & Summary
It is widely accepted that emotion is communicated via multiple models involving both verbal and non-verbal aspects, such as tone, eye movement, facial expression, and body language. Recent studies have demonstrated that body movements can effectively reflect changes in affective state 1 , even among primates 2 . People pay more attention to body expressions than facial expressions or voices when dealing with affective states such as information in high intensity 3 , perceptual ambiguity conditions 4 , or when information from these channels is incongruent 5,6 . As increasing psychological studies indicated the significant role of body movement in transmitting information and emotional states [7][8][9] , artificial intelligence for emotion recognition is changing from facial expression system 10 or body expression system 11,12 to a multi-channel information combination 13 .
Various domains of studies on body parts movement cover gait analysis 14 , body posture analysis, and gesture analysis. One focus of body movement is kinematic information of body movement such as velocity, acceleration, trajectory, and postures, which cannot be accurately and effectively represented by static pictures or verbal descriptions. In recent decades, motion capture technology has made it possible to precisely capture and analyze the kinematic data of each joint [15][16][17][18] . A variety types of stimulus sets have emerged, including point-light displays 19,20 , video clips 21 , images 22 , or virtual agents 14,23,24 . The study of body movement has gradually shifted from concepted research to data-based quantitative research.
However, kinematic information from 2D video is also essential for studying emotional body movements. It is not customary for individuals to equip themselves with sensors, as is commonly done in laboratory settings. Fortunately, many pose estimation projects, such as AlphaPose 25 , Pose Tensorflow 26,27 , OpenPose 28 , and Deeplabcut 29,30 , use machine learning to estimate the posture of persons or animals in videos or pictures and obtain various data, such as the coordinates of joints. They have been applied in some studies in the field of social neuroscience [31][32][33][34][35] . For example, de Gelder and Poyo Solanas proposed the radically distributed model 36 , which suggests an additional mid-level feature analysis between low-level feature and high-level concept analyses. The mid-level features -kinematic features (e.g., velocity, acceleration, vertical movement) and postural features (e.g., limb angle, limb contraction, symmetry, surface, shoulder ratio) -have a specific mapping with the brain. Poyo Solanas, Vaessen, and de Gelder found that the extra-striate body area and fusiform body area exhibit more sensitivity towards postural features than kinematic features 37 . Therefore, we report a larger and standardized dataset with various emotions: the multi-view emotional expressions dataset (MEED). MEED contains 4102 recordings of six emotional (anger, disgust, fear, happiness, sadness, surprise) and neutral body movements from three views (left, front, right). Each recording consists of the frames extracted by OpenPose and the coordinates of pixel space for 25 body joints in each frame. MEED is freely available. We expect to encourage researchers in multiple fields (e.g., affective computing, human-computer interaction, artificial intelligence, social security, and social neuroscience) to fully explore the various features of emotional body movements in daily life. Interdisciplinary research in these fields should also be promoted.

Methods
Preparation phase. Twenty-four college students with acting experience from Dalian University of Technology were recruited with appropriate payment. All participants signed an informed consent, knowing that the recordings they performed would be shared publicly. Two actors dropped out, leaving 22 actors (19-24 years old, mean = 20.6 years) included in the MEED. This study was approved by the Human Research Institutional Review Board of Liaoning Normal University and followed the Declaration of Helsinki (1991).
Thirty-five standardized daily event scenarios (five for each emotion and neutral) with high recognition accuracy (82.9% -100%, mean = 93.4%) were created to guide the actors in the recording phase. The specific content and validation of these scenarios and performances were introduced in our previous work 20, 24 .
Three Microsoft Kinect 2.0 cameras, with a resolution of 15 fps, were placed respectively at the front, left, and right of a 1 m × 1 m sized stage, 1.05 m high from the floor, 2.5 m from the center of the stage, and were controlled by a laptop computer (Microsoft Surface Pro 4). More details can be found in our previous work 38 . recording phase. Actors, wearing in black tights, performed six seconds according to the randomly presented scenario, and several performances were selectively repeated to guarantee robustness. Actors were asked to face the center camera, standing naturally with arms hanging down. All three cameras started recording simultaneously after the actor indicated he/she was ready. The recording phase took approximately two hours, during which the actors may rest at any time.
Pose estimation. OpenPose (v1.7.0), an advanced, reliable bone-extraction library 28  Each video has 97 image frames (see Fig. 2), except part of which are slightly fewer. The horizontal and vertical coordinates (x, y) of 25 keypoints in the pixel space of each frame for each video, as well as the confidence level for determining joint position, were available through pose estimation. Results from pose estimation have two forms: images and data files of joints position. All image files were composed of image frames, skeletal joints, and 25 keypoints (see Fig. 1). For individual recordings, the information in image files were digitized to the datafile of each frame.

Technical Validation
Proportion of unrecognized keypoints. The effectiveness of OpenPose in extracting coordinates depends on various factors such as the velocity of the actor's movement, fps, physical occlusion, etc. A high velocity may cause blurring in some frames and deviations in the position of keypoints.
Additionally, limb occlusion lowers the confidence level for confirming joint positions, and long-term physical occlusion may make subsequent joints unrecognized due to the lack of prior information. Consequently, the coordinates of the unrecognized keypoints in some frames would appear as (0, 0). We consider the proportion of the number of these unrecognized keypoints to the number of all keypoints in all frames of each recording as one of the quality metrics for the 2D pose estimation dataset, called proportion of unrecognized keypoints (PUK), which is defined as where N (0,0) is the total number of unrecognized keypoints in all frames of each recording, and N keypoints and N frame separately refers to 25 keypoints of body pose estimation and total number of frames of each recording. The results showed that the PUK was lowest in the frontal view, with mean values ranging from 0.003 to 0.048 under all conditions (see Table 2 and Fig. 3). confidence level. OpenPose uses confidence maps to assess the predicted data, which is created by the annotated keypoints 28 . Every confidence map is a 2D indication of the possibilities that the body part appears at  www.nature.com/scientificdata www.nature.com/scientificdata/ each pixel location. It will generate a possible area of Gaussian distribution, the center of which is the keypoints. The Gaussian center has a maximum confidence of 1. The further away from the center, the lower the confidence is. In other words, each pixel position in the confidence map has a corresponding confidence value. The number of confidence peaks equals the number of people in the picture being predicted. MEED contains only single-person situations, so there is only one peak per confidence map. The ground-truth confidence map generated by the network is to take the maximum confidence value through a non-maximum suppression algorithm.
This confidence peak is expressed in the pose estimation results as the confidence level (CL) attached to each keypoint estimation. Therefore, we regard the mean CL of 25 keypoints within each recording as the second quality metric for this dataset, which is defined as  Table 3 and Fig. 4). The results of two quality metrics suggest that the pose estimation results are good enough for further analysis.
Performance reliability. To ensure that all 22 actors expressed the instructed emotions equally well, that is, the reliability of these performances, we examined the consistency of the objective movement value across all of them. The objective movement of the recordings of frontal view in MEED was quantified using a customized MATLAB code 40,41 , and prior research has been demonstrated that this movement positively correlates with the intensity of emotion and the motion that observers can perceive from human body 20,42,43 . Specifically, if a pixel in two consecutive frames had a luminance change of more than 10 units, it was considered a pixel motion. The objective movement values were depicted by computing the average number of pixel motions in each frame and video, which were then saved in the frontMovement.csv. We then conducted a reliability analysis of the objective movement value for each emotional and neutral condition across all actors using SPSS 26.0 (https://www.ibm.com/products/spss-statistics). The result showed   www.nature.com/scientificdata www.nature.com/scientificdata/ that the Cronbach alpha coefficient was high under all emotional and neutral conditions (anger = 0.900, disgust = 0.939, fear = 0.919, happiness = 0.875, sadness = 0.929, surprise = 0.927, and neutral = 0.974), suggesting a high reliability of these performances and all actors in MEED express these emotions and scenarios equally.

Usage Notes
MEED is an open-source library that stores the results of 2D pose estimation with six emotions and neutral expression as well as three views. JSON and MAT files can be easily used by data processing software such as MATLAB (https://ww2.mathworks.cn/en/products/matlab.html), R (https://www.r-project.org), and Python (https://www.python.org). For example, the coordination data can be analyzed using representational similarity analysis 44 for the association between kinematic features and postural features of body expressions and decision tree classifier 45 for the relative importance of these features and body parts 46 .
Moreover, the unrecognized coordinates must be fixed if users want to involve them in their analyses. We suggest that users perform interpolation correction, such as linear, polynomial interpolation, and spline interpolation, on the coordinates of individual keypoint in the videos on the time scale as data streams. Given that linear interpolation is limited to the case of non-continuous unrecognized keypoints, we suggest fitting curve instead, such as the Curve Fitting Toolbox in MATLAB (https://ww2.mathworks.cn/products/curvefitting.html) or its built-in functions (spline, makima, pchip). We also recommend using Photoshop (https://www.adobe.com/ products/photoshop.html) for PNG correction when necessary.
MEED is applicable in multiple fields, such as the affective computing of body expressions and corresponding brain mechanisms 37,46 in social neuroscience. Researchers in human-computer interaction, machine learning, sports motion analysis, psychiatry, and social security will also be interested in this dataset. We hope that MEED will be of further assistance to them.

code availability
The MATLAB code for parsing the JSON file and processing the coordinates can be found at https://doi. org/10.5281/zenodo.8185369.